Guideline: Guidelines On Proactive Problem Analysis Techniques

Overview

Proactive Problem Management activities are mostly ongoing activities targeted to improve the overall availability of services and thereby obtain end user satisfaction. The outcomes from the above analysis generally trigger an improvement initiative.

A few examples of proactive problem analysis include:

Pattern analysis of the incident records
Pattern analysis of maintenance records and operational logs
Periodic reviews of major incidents
Trend analysis of warning and exceptional events
Review of service and operational data for any quality issues, etc.

Proactive Problem Analysis can be done by employing preventive and perfective maintenance techniques. Few of these techniques are listed below.

Preventive Maintenance Techniques For Proactive Problem Management

Preventive maintenance is the systematic inspection, detection, correction and prevention of any emerging problems before they become actual problems. A few methods of preventive maintenance include:

Failure Mode and Effects Analysis (FMEA)

FMEA is a systematic, proactive method for evaluating processes to identify where and how it might fail, and to assess the relative impact of different failures, in order to identify the parts of the process that are most in need of change. The Risk Priority Number (RPN) is a numeric assessment of risk assigned to a process, or steps in a process, as part of Failure Modes and Effects Analysis (FMEA), in which a team assigns each failure mode numeric values that quantify likelihood of occurrence, likelihood of detection, and severity of impact. Main steps executed while performing a FMEA includes:

Step 1: Identify potential failures and effects
Step 2: Determine severity
Step 3: Gauge the likelihood of occurrence
Step 4: Failure Detection
Step 5: Calculate the Risk Priority Number (RPN).

RPN = Severity * Occurrence * Detection

RPN should be calculated for the entire design and/or process and documented in the FMEA. Results should reveal the most problematic areas, and the highest RPNs should get highest priority while implementing the actions.

Automated Health Check (AHC)

Automated Health Check provides the fastest and most accurate way to proactively detect and pinpoint the presence and cause of problems that could impact the productivity of WLAN users, before those users report it. Implementing automated health checks for respective technologies, which incorporate proactive monitoring probes into job logs or web server logs to generate alerts, initiating advance remedial action before the occurrence of high severity issues. AHC reduces the costs associated with user productivity loss and troubleshooting process caused by complex wireless problems.

Pre-Processors

Implementing pre-processors on critical data feeds going into central systems is one of the proactive problem management method. A proactive scan of these data feeds for data corruption, invalid references, and missing data prevents a considerable number of high severity incidents.

Self Help

Implementing ‘self-help’ in the form of user guides and FAQ documents for online applications / functionality can significantly reduce the number of user queries and RFI type of service requests.

Automated Archiving

Implementing automated deletion / archiving prevents the loss of application availability due to database/file space congestion issues.

Perfective Maintenance Techniques For Proactive Problem Management

Perfective maintenance is the modification of a software product after delivery to improve its performance or maintainability.

Cycle Time Reduction

Cycle time reduction is the strategy of lowering the time it takes to perform a process in order to improve productivity. In addition, cycle time reduction often improves quality.
Reducing cycle time for critical batch jobs through:

Automation
Tuning
Multi-threading
Regular database maintenance
Performing multiple activities in parallel
Re-sequencing.

Application Renovation

Application Renovation involves a brainstorm by the delivery team of current issues and improvement ideas around the following headings:

Automation
Availability
Maintenance
Performance Optimization
Scalability
Stability
Robustness
Volatility
Usability
Vulnerability.

Left Shift

Left shift is one of the core components of preventive maintenance strategy. It is a deliberate approach adopted with intent to:

Reduce incident inflow for L2 and L3 support team
Reduce ticket backlog
Improve TAT
Higher customer satisfaction scores due to faster responses
Cost benefit for the client since the cost of first line support is lower than that of second line.

Some of the areas where Left Shift can be applied includes:

Tickets that are mainly resolved by Service Desk
Tickets which involve lot of manual efforts
Recurring tickets like password reset, user ID creation, etc.
Authorization and access management related tickets
Routine health checks & batch job failure tickets
Ad-hoc report generations
General admin and data change kind of tickets
Tickets which are resolved by educating the user.

Use of Memory Debugging Tools

A memory debugger also known as a runtime debugger is used for finding software memory problems such as memory leaks and buffer overflows. These are due to bugs related to the allocation and deallocation of dynamic memory. These tools help in periodically identifying and fixing memory leaks in transaction intensive systems using sophisticated memory debugging tools like IBM Rational Purify.

Memory debuggers can help programmers to avoid software anomalies that would exhaust the computer system memory, thus ensuring high reliability of the software even for long runtimes.
Finding memory issues such as leaks can be extremely time consuming. Using a tool to detect memory misuse makes the process much faster and easier.

A few memory debugging tools in the market includes dmalloc (any OS), IBM Rational Purify (UNIX and Windows OS), TotalView (Unix, Mac OS X), WinDbg (Windows OS), Daikon (Unix, Windows, Mac OS X), etc.

Removal of Bottlenecks

A bottleneck in a process occurs when input comes in faster than the next step can use it to create output. Identifying and fixing bottlenecks is very important to reduce the problems related to customer dissatisfaction, resource waste, high cost, high effort, insufficient resources, poor quality services, etc. RCA techniques shall be applied to avoid bottlenecks.

A few pointers to reduce the bottleneck includes:

Reduce the strain on the bottleneck
Organize similar work items in batches
Add more people or resources to increase the capacity and speed up the work
Redesign the critical online and batch components.

Documenting Critical Business Processes

Critical Business Functions are business processes that must be restored in the event of a disruption to ensure the ability to protect the organization’s assets and meet the needs of the organization as well as the business. These are the processes which are vital to the business functions, vital to the operation of the company, processes that are in direct contact with the customer and those which end up in great risk, if not handled properly. It is very important to take care these processes in order to deliver the key products and services which enable an organization to meet its objective. Preparing end-to-end process flows with detailed checklist for critical business processes helps in improving Client satisfaction and reducing the escalations.